PatTrieSort - External String Sorting based on Patricia Tries
نویسندگان
چکیده
External merge sort belongs to the most efficient and widely used algorithms to sort big data: As much data as fits inside is sorted in main memory and afterwards swapped to external storage as so called initial run. After sorting all the data in this way block-wise, the initial runs are merged in a merging phase in order to retrieve the final sorted run containing the completely sorted original data. Patricia tries are one of the most space-efficient ways to store strings especially those with common prefixes. Hence, we propose to use patricia tries for initial run generation in an external merge sort variant, such that initial runs can become large compared to traditional external merge sort using the same main memory size. Furthermore, we store the initial runs as patricia tries instead of lists of sorted strings. As we will show in this paper, patricia tries can be efficiently merged having a superior performance in comparison to merging runs of sorted strings. We complete our discussion with a complexity analysis as well as a comprehensive performance evaluation, where our new approach outperforms traditional external merge sort by a factor of 4 for sorting over 4 billion strings of real world data. TYPE OF PAPER AND
منابع مشابه
An Adaptive Algorithm for Splitting Large Sets of Strings and Its Application to Efficient External Sorting
In this paper, we study the problem of sorting a large collection of strings in external memory. Based on adaptive construction of a summary data structure, called adaptive synopsis trie, we present a practical string sorting algorithm DistStrSort, which is suitable to sorting string collections of large size in external memory, and also suitable for more complex string processing problems in t...
متن کاملImplementation and Evaluation of String B-Tree
String B-tree is a combination of B-tree and Patricia tries for internal-node indices. Instead of storing prefix compressed keys at each index node, each key is stored in full in a consecutive sequence of data blocks, and each downward-traversal decision is made by a combination of Patricia trie search and the consultation of a single key. String B-tree has the same worst case performance as B-...
متن کاملCompact Suffix Trees Resemble PATRICIA Tries: Limiting Distribution of the Depth
Suffix trees are the most frequently used data structures in algorithms on words. In this paper, we consider the depth of a compact suffix tree, also known as the PAT tree, under some simple probabilistic assumptions. For a biased memoryless source, we prove that the limiting distribution for the depth in a PAT tree is the same as the limiting distribution for the depth in a PATRICIA trie, even...
متن کاملLaws of Large Numbers and Tail Inequalities for Random Tries and Patricia Trees
We consider random tries and random patricia trees constructed from n independent strings of symbols drawn from any distribution on any discrete space. If Hn is the height of this tree, we show that Hn/E{Hn} tends to one in probability. Additional tail inequalities are given for the height, depth, size, and profile of these trees and ordinary tries that apply without any conditions on the strin...
متن کاملImplementation and Evaluation of an External Memory String B-Tree
Preprocessing texts of huge size to answer substring queries is not trivial whenever considering realistic models. We approach this problem by offering an efficient implementation of the String B-Tree data structure, which aims to solve the substring search problem under the dynamic operations. We achieve optimal space usage for the Patricia Tries by representing them via multiarray encoding an...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- OJDB
دوره 2 شماره
صفحات -
تاریخ انتشار 2015